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Abstract 

Data movement across computational grids and across memory hierarchy of individual grid machines is known to 
be a limiting factor for application involving large data sets . In this paper we introduce the Data Cube Operator on 
an Arithmetic Data Set which we call Arithmetic Data Cube (ADC). We propose to use the ADC to benchmark grid 
capabilities to handle large distributed data sets. The ADC stresses all levels of grid memory by producing 2 d views of 
an Arithmetic Data Set of d-tuples described by a small number of parameters. We control data intensity of the ADC 
by controlling the sizes of the views through choice of the tuple parameters . 


1 Introduction 


The main object of data warehousing and On-Line Analytical Processing (OLAP), decision support database sys- 
tems, data mining systems and resource brokers is a data set characterized by a number d of dimension attributes and 
a measure attribute. The data set consists of tuples (zT, . . . , id, c). Each dimension attribute ij assumes values in some 
range, say in an interval [1, mj — 1], and c is a cost function (a measure) associated with the tuple (n, . . . , id)- The 
goal of OLAP is to assist users to discover patterns and anomalies in the data set by providing sort query execution 
times [13]. 

A standard tool of OLAP is the Data Cube Operator (DCO) [3], which computes views of the data set. For a chosen 
subset of k attributes a view is a set of ^-tuples containing only the chosen attributes with accumulated measures of 
the duplicates. If technically possible, DCO computes 2 d views on all possible subsets of the dimensions. The 
calculated DC reduces querys of multidimensional data to simple look-ups. There are approaches [1,5] for mining 
multi-dimensional association rules and answering iceberg queries by computing an iceberg cube, which contains only 
aggregates above a certain threshold. 

The input data sets and some of the materialized views often do not fit into the main computer memory, thus 
DCO computation requires a careful reuse of data loaded into the main memory (and all levels of cache). As a rule, 
computation of the DCO spills data across all levels of memory, making DCO especially interesting as a data intensive 
benchmark. 

A large number of papers is devoted to efficient computation of the DCO [6, 12, 18] and many companies have 
proprietary algorithms for DCO computations. Some authors propose parallel DCO computation algorithms [8, 10]. 
To improve the efficiency of querying data cubes a number of publications consider calculation and storage of data 
cubes as condensed cubes [17] or as other highly compressed structures [14]. 

We are not trying to evaluate DCO algorithms here, instead we are designing a benchmark for computational grids. 
For the reference implementation we choose a greedy algorithm [6] that computes each view from the smallest parent 
(a view having one more attribute). We assume that all attribute values are integers. Although real OLAP data sets 
and existing OLAP benchmarks [11, 16] use mostly strings as attribute values, this is not a significant limitation, 
since there are techniques such as hashing for mapping strings to integers. One of the advantages of using integers as 
attribute values is reduction in the size of the input data sets and materialized views. 

Many data sets are available to test OLAP systems, DCO algorithms and data mining algorithms, for example, the 
ABP-1 and TPC-D benchmark databases [11, 16]. For benchmarking purposes the most appropriate is a synthetic data 
set which can be generated by a small program, so that the data set will be scalable, the distribution of the benchmark 
will be manageable, and replication of the data set on the computational grid will incur a small overhead. Also, a 
synthetic data set, as in many real applications, can be generated in a distributed fashion, which saves the effort of 
splitting and distributing the data set. 

In available synthetic data sets the tuples are randomly generated, however there is no way to control the sizes of 
the data views. In the next section we introduce the Arithmetic Data Set , which is similar to the randomly generated 
data sets, but has the advantage of a priori known sizes of the views. The latter simplifies the implementation of the 
greedy DCO algorithm. For real or random data, one can estimate the view sizes by sampling or analytical methods 
[6, 15]. 
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2 The Arithmetic Data Set 


The purpose of constructing An Arithmetic Data Set is to have a data set whose view sizes can be well controlled. 
An Arithmetic Data Set S is a subset of a group Q defined by 

d 

Q = ®(Z/m i Z)* > 

i= 1 

where (Z/m*Z)* is the set of integers modulo m* coprime with m*. An element of S can be represented by a tuple 
x = (xi, . . . , Xd). The subset 5 is defined by a seed s = (si, ...,$<*) G Q> a generator p = (pi , . ■ . ,Pd) € <3, 
Sj, 9i 7 ^ 0, i = 1 , . . . , d and the total number of elements n : 


n— 1 

j-o 

We choose 1 < p* < m* to be one of /* — J(Z/m;Z)*| numbers which are coprime with m*. Let p* be the order 
of Qi that is the smallest integer such that gf = 1 mod(rai). Since g\ can assume at most /* different values then 
g{* = 1 mod(mi) and qi divides fa. These tuples are different elements of Q if LCM 1 (pi , . . . , qa) > n, see Corollary 
2 . 

Data Views. For any subset containing k of the cube dimensions I = {*!,...,£*} C {1, . . . , d} the /-view of 
x G Q is defined as a projection of x on the face of the cube defined by /: 

Xj — {X { t , . . . j X i k ) . 

The /-view of S is comprised of the /-view of all elements of S : 


Si — {xi\x G 5}. 


View Sizes. For a given /-view we are interested to find out how many tuples there are in 5/. To do this we estimate 
the multiplicity of a tuple x G 5/, defined as the number of tuples of S having the same /-view as x. 

Two tuples sigj and s/p* are the same iff g * — gj or p^~ J — 1/ considered as elements of Qi . Hence, the 
multiplicity p of sig] can be calculated as follows: 


p ~ \{0 < k < n \ k ~~ j = 0 mod(pi), i G /}|. 

Since the smallest nonzero solution of the system of congruences k - j = 0 mod(pi), i G /, is A / = LCMi € /(pf) we 
find that < p < 4 * 1 , which proves the following assertion. 


PROPOSITION 1. Let A J = LCMi e i(qi). The multiplicity p of any tuple of an I -view ofS can be estimated as 




If A / > n, then the second inequality of the proposition implies that multiplicity of each element of Si is 1, hence 
\Si\ = n. Obviously, \Si\ < A/. Hence, we have the following formula for |S/|: 


COROLLARY 2. For the size of an I-view ofS we have the following relation: 



ifn< A/; 
otherwise. 


l LCM stands for the Least Common Multiple. 
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3 Choice of the Parameters 

To illustrate a possible choice of the parameters for the grid benchmarks we choose rrii to be prime numbers and 
gi to be generators of (Z/m*Z)*, hence having period q t — fi = m* — 1. Also, we choose ra* such that rrii — 1 has 
many small prime factors so that A/ has a good chance of been small. This approach gives us good control over the 
sizes of the data set and its views. Our actual choice of the rrii is shown in the Table 1. 

We choose 4 groups of the smallest prime numbers {3,5, 7}, {11, 13, 17, 19}, {23, 29, 31, 37), and {41,43,47,53, 
59}. For each group we choose 5 smallest primes rrii such that prime factors of rrii — 1 are 2 and numbers from this 
group 2 . This set of parameters gives us a data set of 2 5 • 3 2 • 5 2 • 7 2 • 11 • 13 ♦ 17 • 19 2 • 23 • 29 • 31 2 • 37 • 41 • 43 • 47 * 53 • 59 
different tuples and, for example we can choose n = 2 * 11 • 23 • 41 • 3 ■ 13 • 29 • 43 • 5 * 17 = 85759918530. At the 
same time the 5-dimensional views relative to each of the groups is small relative to the number of the total elements 
in the data set. 


Table 1 . Dimensions of the Arithmetic Data Cube and their factorizations. Here ’’Least Generator” 7 * is 
the smallest generator of (Z/mjZ)*, the ’’Generator” is the chosen generator of (Z /m^Z)* and the ”Exp” 
is Gi such that gi = 7®* . 


Prime 

Factorization of m — 1 

Least Generator 

Exp 

Generator 

Seed (m + 1)/2 

1. 421 

2 2 • 3 • 5 • 7 

2 

11 

364 

211 

2. 601 

2 3 • 3 • 5 2 

7 

13 

412 

301 

3. 631 

2 • 3 2 • 5 • 7 

3 

17 

334 

316 

4. 701 

2 2 • 5 2 • 7 

2 

19 

641 

351 

5. 883 

2 • 3 2 • 7 2 

2 

23 

108 

442 

LCM 

2 3 • 3 2 • 5 2 • 7 2 = 88200 




6. 419 

2-11-19 

2 

23 

228 

210 

7. 443 

2-13-17 

2 

29 

98 

222 

8. 647 

2-17-19 

5 

31 

94 

324 

9. 21737 

2* -11 -13 -19 

31 

37 

8280 

10869 

10. 31769 

2 3 • 11 • 19 2 

7 

41 

26667 

15885 

LCM 

2 3 • 11 • 13 • 17 -19 2 = 7020728 




11. 1427 

2 • 23- 31 2 

2 

41 

595 

714 

12. 18353 

2 4 • 31 • 37 

3 

43 

8397 

9177 

13. 22817 

2 s • 23 • 31 

3 

47 

15046 

11409 

14. 34337 

2 & • 29 • 37 

3 

53 

15699 

17169 

15. 98717 

2 2 - 23 -29 -37 

2 

59 

62206 

49359 

LCM 

2 & • 23 • 29 • 31 2 ■ 37 = 758228608 




16. 3527 

2-41-43 

5 

3 

125 

1764 

17. 8693 

2 a -41-53 

3 

5 

443 

4347 

18. 9677 

2 2 • 41 • 59 

2 

7 

128 

4839 

19. 11093 

2‘ 2 -47-59 

2 

11 

2048 

5547 

20. 18233 

2 3 • 43 • 53 

3 

13 

8052 

9117 

LCM 

2 3 • 41 • 43 ■ 47 • 53 • 59 = 2072850776 





4 Air Traffic Control Example 

We illustrate the Data Cube Operator for an example of Air Traffic Control (ATC) data [9]. Each of about 20 
national ATC Centers obtain flight data from airports and radars in real time. Typical records are shown in Table 2 and 
a typical query is as follows: 


2 Since we use odd primes, rrii — 1 always has 2 as a factor 



5 


Find AC type 

where Busy = 1 

and ETA is Between 1105 and 1110 
and destination is CLE 

The queries should be executed in real time and can be posted at any of the centers, implying that the flight data 
must be communicated between the centers. One possible way to insure a short query response time is to replicate 
the Data Cube across the centers. This constitutes an example of a distributed dynamic DCO which requires real time 
query response and DCO update. The ATC example can be extended to an example of Satellite and Spacecraft Control 
system. 


Table 2. Air Traffic Control Data. Typical Query: Find AC type where Busy = 1 and ETA is 
Between 1105 and 1110 and destination is CLE. 


Flight ID 

AC type 

ETA 

Destination 

Controller 

Busy 

UAL 147 

747 

1100 

CLE 

17 

1 

NW 1186 

767 

1132 

ORD 

26 

i 

KLM761 

747 

1105 

CLE 

8 

1 

AA 2345 

A320 

1135 

ORD 

17 

i 

UAL 258 

737 

1112 

CLE 

9 

1 

AA 2744 

737 

1105 

CAK 

11 

1 

SW 377 

767 

1108 

CLE 

87 

1 


5 Related Work 

The benchmarking of data mining systems is well established area of High Performance Computing [11, 16]. 
These benchmarks are designed to compare performance of query systems running on a server. On the other hand, a 
number of benchmarks have been designed for testing computational grids [2, 11, 16]. The grid benchmarking effort 
is currently supported by the Grid Benchmarking Research Group at the Global Grid Forum. These benchmarks are 
mostly computationally intensive and are derived from NAS Parallel Benchmarks. We propose the Arithmetic Data 
Cube (ADC) as a data intensive grid benchmark which extends typical data mining operations into a grid environment. 

6 Conclusions 

We show that ADC represents an important set of computations in the OLAP and data mining. We give an example 
of a dynamic real time system performing the set of operations specified in ADC. 

The ADC is data intensive since 

• it mostly involves logical operations 

• the size of the output data set can significantly exceed the size of the original data set 

• existing algorithms perform few operations per tuple per memory access (and are similar to the merge in this 
respect) 

The advantages of ADC as a grid benchmark are that 

• it is described by a small number of parameters and has a priori known sizes of the views 

• the views can be generated independently 

• the overhead of combining the views is predictable 
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• the data set can be partitioned into a number of independently generated subsets 

• the elements of the data set are pseudo random 

These two properties make ADC a strong candidate for a data intensive grid benchmark to be considered by the 
Global Grid Forum Grid Benchmarking Research Group (GB-RG) [4]. 
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